Skip to content

Conversation

@Emaasit
Copy link
Contributor

@Emaasit Emaasit commented Jun 5, 2015

Here are more examples on SparkR DataFrames including creating a Spark Contect and a SQL
context, loading data and simple data manipulation.

Here are more examples on SparkR DataFrames including creating a SQL
context, loading data and simple data manipulation
@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 5, 2015

@shivaram Here is the new submission. I would like to submit a few more examples on statistical modeling and machine learning on SparkR DataFrames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have the Apache License at the top of every file. You can see https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R#L1 for an example

Also per our style guide we don't put in Author names / dates in the file itself as this is tracked in the commit log

@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 5, 2015

@shivaram I have added the Apache license at the top of every file, removed author name & date.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should probably be 'Load SparkR library into your R session'

Now using sqlContext as the variable name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be describe and not Describe ?

Emaasit added 7 commits June 7, 2015 19:31
provided two options for creating DataFrames. Option 1: from local data frames and option 2: directly create DataFrames using read.df function
Deleted the source() function and combined all the code into one file
Deleted the getting started file and combined all the code into one file
Renamed file to data-manipulation.R
@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 8, 2015

@shivaram I wanted to provide two options for creating DataFrames. One where R users can convert their local dataframes into DataFrames and the second using the read.df().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would read.csv that is part of base R also work for this ? I know that data.table is more efficient, but I would like to avoid installing new of packages in the example.

Replaced the data.table function (fread) with base R function for reading csv files (read.csv)
@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 8, 2015

@shivaram Yes, the base R function works. I have changed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we take this in as a command line argument ? I think something like

args <- commandArgs(trailing = TRUE)
if (length(args) != 1) {
  print("Usage: data-manipulation.R <path-to-flights.csv")
  print("The data can be downloaded from: https://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv ")
  q("no")
}
flightsCsvPath <- args[[1]]

should do the trick

Taking in data set as a command line argument
@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 9, 2015

@shivaram I fixed that. You will notice that read.csv() does not work well with SSL, that is https connections. so I changed the connection to http.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be sparkRSQL and not SparkRSQL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to run this locally and this step is very slow for the dataset we are using here (I filed https://issues.apache.org/jira/browse/SPARK-8277) due to the way we convert local data frames to lists.

I see two options here: (1) Use fewer rows in the example file, so that this runs fast or (2) use a different dataset to demonstrate creating a SparkR DataFrame from a local dataframe (the CSV reader is fine)

Let me know which you think is better.

To create a SparkR DataFrame, I used fewer rows of the local data frame.
@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 10, 2015

@shivaram To create a Spark DataFrame from a local data frame, I used a subset of the data with fewer rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should also go inside the if block

@shivaram
Copy link
Contributor

Jenkins, ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source here needs to be com.databricks.spark.csv

BTW @rxin is there some way we can map source = csv to that automatically ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not if csv is outside this ... maybe we can provide a way for data sources to register short names.

@shivaram
Copy link
Contributor

Thanks @Emaasit for the update. I just had a few more things that I ran into while executing the example. Also you can verify some of these things by just running the example on your machine -- I just used a command of the form

./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 examples/src/main/r/data-manipulation.R ./flights.csv

to check things

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34619 has finished for PR 6668 at commit 3a97867.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Emaasit
Copy link
Contributor Author

Emaasit commented Jun 10, 2015

@shivaram Ok. Got you.

@shivaram
Copy link
Contributor

shivaram commented Jul 6, 2015

LGTM. Thanks @Emaasit for this PR. There are some outstanding comments, but I'll fix them during the merge.

@Emaasit
Copy link
Contributor Author

Emaasit commented Jul 6, 2015

Thanks @shivaram.

@asfgit asfgit closed this in 293225e Jul 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants